Spatial-Channel Attention Multi-sensor Fusion Based on Bird's-Eye View
JI Yuzhe1, CHEN Yijie2, YANG Liuqing1,2, ZHENG Xinhu2
1. Internet of Things Thrust, The Hong Kong University of Science and Technology(Guangzhou), Guangzhou 511455;
2. Intelligent Transportation Thrust, The Hong Kong University of Science and Technology(Guangzhou), Guangzhou 511455
Object perception based on bird's-eye view(BEV) is one of hot issues, but studies on multi-sensor fusion for BEV are still insufficient. Therefore, a multi-sensor fusion module based on spatial-channel attention is proposed. Spatial errors between multiple sensors can be effectively corrected by adding local attention mechanisms to features of different modalities. By using transpose attention operations, the image and point cloud data are fully integrated to resolve the heterogeneity between different modal semantics. Consequently, the fused BEV features achieves more comprehensive and accurate perception by effectively combining the unique information of each sensor without introducing spatial misalignment. Experiment on nuScenes dataset and extensive ablation experiments show that the proposed fusion module effectively improves the accuracy of object detection. Visualization results demonstrate that the fused features can capture more complete and accurate information, especially in distant objects detection.
[1] IOANNIDOU A, CHATZILARI E, NIKOLOPOULOS S, et al. Deep Learning Advances in Computer Vision with 3D Data: A Survey. ACM Computing Surveys, 2018, 50(2). DOI: 10.1145/3042064.
[2] HUANG K, SHI B T, LI X, et al. Multi-modal Sensor Fusion for Auto Driving Perception: A Survey[C/OL].[2023-08-12]. https://arxiv.org/pdf/2202.02703v1.pdf.
[3] FENG D, HAASE-SCHÜTZ C, ROSENBAUM L, et al. Deep Multi-modal Object Detection and Semantic Segmentation for Autonomous Driving: Datasets, Methods, and Challenges. IEEE Transactions on Intelligent Transportation Systems, 2020, 22(3): 1341-1360.
[4] FAYYAD J, JARADAT M A, GRUYER D, et al. Deep Learning Sensor Fusion for Autonomous Vehicle Perception and Localization: A Review. Sensors, 2020, 20(15). DOI: 10.3390/s20154220.
[5] PANG S, MORRIS D, RADHA H.CLOCs: Camera-LiDAR Object Candidates Fusion for 3D Object Detection // Proc of the IEEE/RSJ International Conference on Intelligent Robots and Systems. Washing-ton, USA: IEEE, 2020: 10386-10393.
[6] ASVADI A, GARROTE L, PREMEBIDA C, et al. Multimodal Vehicle Detection: Fusing 3D-LiDAR and Color Camera Data. Pattern Recognition Letters, 2018, 115: 20-29.
[7] VORA S, LANG A H, HELOU B, et al. Pointpainting: Sequential Fusion for 3D Object Detection // Proc of the IEEE/CVF Confe-rence on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2020: 4603-3611.
[8] XIE L, XIANG C, YU Z X, et al. PI-RCNN: An Efficient Multi-sensor 3D Object Detector with Point-Based Attentive Cont-Conv Fusion Module. Proceedings of the AAAI Conference on Artificial Intelligence, 2020, 34(7): 12460-12467.
[9] LIANG M, YANG B, CHEN Y, et al. Multi-task Multi-sensor Fusion for 3D Object Detection // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2019: 7337-7345.
[10] YOU Y R, WANG Y, CHAO W L, et al. Pseudo-LiDAR++: Accurate Depth for 3D Object Detection in Autonomous Driving[C/OL].[2023-08-12]. https://arxiv.org/pdf/1906.06310v1.pdf.
[11] WANG J D, WEI Z, ZHANG T, et al. Deeply-Fused Nets[C/OL].[2023-08-12]. https://arxiv.org/abs/1605.07716.
[12] PHILION J, FIDLER S. Lift, Splat, Shoot: Encoding Images from Arbitrary Camera Rigs by Implicitly Unprojecting to 3D // Proc of the 16th European Conference on Computer Vision. Berlin, Germany: Springer, 2020: 194-210.
[13] HUANG J J, HUANG G, ZHU Z, et al. BEVDet: High-Perfor-mance Multi-camera 3D Object Detection in Bird-Eye-View[C/OL].[2023-08-12]. https://arxiv.org/pdf/2112.11790v1.pdf.
[14] XIE B Z, YU Z D, ZHOU D Q, et al. M2BEV: Multi-camera Joint 3D Detection and Segmentation with Unified Birds-Eye View Representation[C/OL].[2023-08-12]. https://arxiv.org/pdf/2204.05088.pdf.
[15] ZHOU Y, TUZEL O.VoxelNet: End-to-End Learning for Point Cloud Based 3D Object Detection // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2018: 4490-4499.
[16] YANG Y, MAO Y X, LI B Y.SECOND: Sparsely Embedded Con-volutional Detection. Sensors, 2018, 18(10). DOI: 10.3390/s18103337.
[17] ZHOU Y, SUN P, ZHANG Y, et al. End-to-End Multi-view Fusion for 3D Object Detection in LiDAR Point Clouds // Proc of the Conference on Robot Learning. San Diego, USA: JMLR, 2020: 923-932.
[18] LECUN Y, BENGIO Y. HINTON G. Deep Learning. Nature, 2015, 521: 436-444.
[19] LIU Z J, TANG H T, AMINI A, et al. BEVFusion: Multi-task Multi-sensor Fusion with Unified Bird's-Eye View Representation // Proc of the IEEE International Conference on Robotics and Automation. Washington, USA: IEEE, 2023: 2774-2781.
[20] BORSE S, KLINGNER M, KUMAR V R, et al. X-Align: Cross-Modal Cross-View Alignment for Bird's-Eye-View Segmentation // Proc of the IEEE/CVF Winter Conference on Applications of Computer Vision. Washington, USA: IEEE, 2023: 3286-3296.
[21] LIU Z, LIN Y T, CAO Y, et al. Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows // Proc of the IEEE/CVF International Conference on Computer Vision. Washington, USA: IEEE, 2021: 9992-10002.
[22] LIN T Y, DOLLÁR R, GIRSHICK K, et al. Feature Pyramid Networks for Object Detection // Proc of the IEEE Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2017: 936-944.
[23] BAI X Y, HU Z Y, ZHU X G, et al. Transfusion: Robust LiDAR-Camera Fusion for 3D Object Detection with Transformers // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Re-cognition. Washington, USA: IEEE, 2022: 1080-1089.
[24] SZEGEDY C, LIU W, JIA Y Q, et al. Going Deeper with Convolutions // Proc of the IEEE Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2015. DOI: 10.1109/CVPR.2015.7298594.
[25] SZEGEDY C, VANHOUCKE V, LOFFE S, et al. Rethinking the Inception Architecture for Computer Vision // Proc of the IEEE Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2016: 2818-2826.
[26] ZHANG H, WU C R, ZHANG Z Y, et al. ResNeSt: Split-Attention Networks // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2022: 2735-2745.
[27] EL-NOUBY A, TOUVRON H, CARON M, et al. XCiT: Cross-Covariance Image Transformers[C/OL].[2023-08-12]. https://arxiv.org/pdf/2106.09681.pdf.
[28] CAESAR H, BANKITI V, LANG A H, et al. nuScenes: A Multimodal Dataset for Autonomous Driving // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2020: 11618-11628.
[29] GEIGER A, URTASUN R.Vision Meets Robotics: The KITTI Da-taset. The International Journal of Robotics Research, 2013, 32(11): 1231-1237.
[30] LOSHCHILOV L, HUTTER F.Decoupled Weight Decay Regularization[C/OL]. [2023-08-12].https://arxiv.org/abs/1711.05101.